graph LR
A["ScreenSpot<br/>Cropped screenshots<br/>Target: 2.01% of image"] --> B["Too easy<br/>for frontier models"]
B --> C["ScreenSpot-Pro<br/>Full-screen, high-res<br/>Target: 0.07% of image"]
C --> D["Tests real-world<br/>professional GUI<br/>grounding"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
ScreenSpot-Pro
A GUI grounding benchmark for professional high-resolution computer use — testing whether AI can locate tiny UI elements across 23 applications and 5 industries
Keywords: ScreenSpot-Pro, GUI grounding, GUI agent, screen understanding, UI element localization, high-resolution, professional software, multimodal LLM, computer use, visual grounding, Photoshop, AutoCAD, VSCode, MLLM benchmark, ScreenSeekeR

Introduction
GUI agents — AI systems that can operate computer interfaces on behalf of users — represent one of the most ambitious frontiers in AI. But while models have made progress on simple tasks like web browsing and mobile navigation, they collapse on professional software. The dense toolbars, tiny icons, and high-resolution multi-panel layouts of applications like Photoshop, AutoCAD, MATLAB, and Visual Studio Code remain far beyond their reach.
ScreenSpot-Pro quantifies this gap. It is a GUI grounding benchmark featuring 1,581 expert-annotated tasks across 23 professional applications, 5 industries, and 3 operating systems — all captured at authentic high resolutions. The challenge: given a natural language instruction and a full-screen screenshot, locate the exact UI element to click. Targets occupy only 0.07% of the screen area on average — 29× smaller than the original ScreenSpot benchmark.
“Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%.” — ScreenSpot-Pro Paper
What Is ScreenSpot-Pro?
ScreenSpot-Pro is a benchmark that evaluates whether multimodal large language models (MLLMs) can ground natural language instructions to precise UI element locations in high-resolution professional screenshots. Unlike prior benchmarks that used cropped or simplified screenshots, ScreenSpot-Pro uses full, unmodified screen captures from real expert workflows.
Key Characteristics
| Feature | Details |
|---|---|
| Total tasks | 1,581 instructions (each in a unique screenshot) |
| Applications | 23 across 5 professional industries + OS commons |
| Operating systems | Windows, macOS, Linux |
| Resolution | >1080p (1920×1080), including dual-monitor setups |
| Target size | 0.07% of image area on average (29× smaller than ScreenSpot) |
| Element types | Text (62.6%) and Icons (37.4%) |
| Annotation | Expert users with 5+ years experience; dual-reviewer quality control |
| Multilingual | English + Chinese instructions for all tasks |
| License | CC BY 4.0 |
Applications and Industries
ScreenSpot-Pro covers a uniquely diverse range of professional software:
graph TD
SSP["ScreenSpot-Pro<br/>23 Applications"] --> DEV["Development<br/>& Programming"]
SSP --> CRE["Creative<br/>Software"]
SSP --> CAD["CAD &<br/>Engineering"]
SSP --> SCI["Scientific &<br/>Analytical"]
SSP --> OFF["Office<br/>Suite"]
SSP --> OS["Operating System<br/>Commons"]
DEV --> D1["VSCode · PyCharm<br/>Android Studio<br/>Quartus · VMware"]
CRE --> C1["Photoshop · Premiere<br/>Illustrator · Blender<br/>FruitLoops · Unreal Engine<br/>DaVinci Resolve"]
CAD --> CA1["AutoCAD · SolidWorks<br/>Inventor · Vivado"]
SCI --> S1["MATLAB · Origin<br/>Stata · EViews"]
OFF --> O1["Word · PowerPoint<br/>Excel"]
OS --> OS1["Windows 11<br/>macOS · Linux"]
style SSP fill:#e74c3c,color:#fff,stroke:#333
style DEV fill:#3498db,color:#fff,stroke:#333
style CRE fill:#27ae60,color:#fff,stroke:#333
style CAD fill:#f39c12,color:#fff,stroke:#333
style SCI fill:#8e44ad,color:#fff,stroke:#333
style OFF fill:#e67e22,color:#fff,stroke:#333
style OS fill:#6cc3d5,color:#fff,stroke:#333
What Makes It So Hard?
The core difficulty comes from three compounding factors:
- Professional complexity — applications like AutoCAD and MATLAB have hundreds of densely packed buttons, menus, and panels
- High resolution, tiny targets — at full-screen resolution, the target element averages only 0.07% of the image area
- Specialized icons — professional tools use domain-specific icons that are rarely seen in web training data
In the original paper, GPT-4o scored only 0.8% on direct grounding — barely above random chance. Even the best specialist model (OS-Atlas-7B) achieved just 18.9%.
Who Built It?
ScreenSpot-Pro was developed by researchers at the National University of Singapore (NUS), East China Normal University, and Hong Kong Baptist University:
- Kaixin Li, Zhiyong Huang, Tat-Seng Chua — National University of Singapore
- Ziyang Meng — East China Normal University
- Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma — Hong Kong Baptist University
The benchmark was published at the Workshop on Reasoning and Planning for Large Language Models (2025).
| Resource | Link |
|---|---|
| arXiv paper | arxiv.org/abs/2504.07981 |
| Leaderboard | gui-agent.github.io/grounding-leaderboard |
| GitHub | github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding |
What Skills Does It Test?
ScreenSpot-Pro evaluates a very specific but critical capability: GUI visual grounding — the ability to translate a natural language instruction into a precise screen coordinate.
| Capability | What It Tests |
|---|---|
| High-resolution perception | Processing screenshots at >1080p without losing detail |
| Tiny element localization | Finding targets occupying 0.07% of the screen area |
| Professional domain knowledge | Understanding industry-specific UI patterns (toolbars, panels, menus) |
| Icon comprehension | Recognizing specialized icons (e.g., blend modes in Photoshop, circuit symbols in Vivado) |
| Cross-platform understanding | Working across Windows, macOS, and Linux interfaces |
| Bilingual instruction following | Grounding from both English and Chinese instructions |
Example Tasks
Tasks range from straightforward to highly specialized:
- “Refresh the file explorer” — VSCode (icon target)
- “Unlink audio and video” — Premiere (text target in a context menu)
- “Change the coordinate mode of the object” — Blender (icon target in a dense toolbar)
- “Select the SM1.smf file in Quartus window” — Quartus (text target in a file browser)
- “Disable masking” — Origin (tiny icon in a crowded toolbar)
Current Leaderboard
The leaderboard below shows model accuracy on ScreenSpot-Pro. The metric is click accuracy: whether the model’s predicted click point falls within the annotated ground-truth bounding box.
Source: ScreenSpot-Pro Leaderboard (consulted March 29, 2026). Last updated November 17, 2025. Results collected using greedy decoding; micro-average numbers reported.
Top 20 Models
| Rank | Model | Overall (%) |
|---|---|---|
| 1 | KV-Ground-GuiOwl1.5-0315-8B-ZoomIn | 80.5 |
| 2 | Holo2-235B-A22B (Agentic) | 78.5 |
| 3 | MAI-UI-32B (MVP) | 77.5 |
| 4 | KV-Ground-GuiOwl1.5-4B-0228-ZoomIn | 76.4 |
| 5 | Holo2-30B-A3B (Agentic) | 75.2 |
| 6 | MVP_Qwen3VL-32B | 74.1 |
| 7 | MAI-UI-32B (Zoom In) | 73.5 |
| 8 | KV-Ground-GuiOwl1.5-0315-8B | 73.2 |
| 9 | MAI-UI-8B (Zoom In) | 71.9 |
| 10 | Holo2-8B (Agentic) | 71.4 |
| 11 | AdaZoom-GUI-Refine | 71.3 |
| 12 | Holo2-235B-A22B | 70.6 |
| 13 | KV-Ground-Qwen3VL-4B-ZoomIn | 70.3 |
| 14 | UI-Venus-1-5-30B-A3B | 69.6 |
| 15 | Holo2-4B (Agentic) | 68.6 |
| 16 | UI-Venus-1-5-8B | 68.4 |
| 17 | MAI-UI-32B | 67.9 |
| 18 | KV-Ground-GuiOwl1.5-0228-4B | 67.0 |
| 19 | Holo2-30B-A3B | 66.1 |
| 20 | MAI-UI-8B | 65.7 |
Notable General-Purpose Models
| Rank | Model | Overall (%) |
|---|---|---|
| 41 | Qwen2.5-VL-72B-Instruct | 53.3 |
| 49 | Qwen2.5-VL-32B-Instruct | 48.0 |
| 56 | UI-TARS-72B | 38.1 |
| 70 | GPT5-minimal (resized) | 18.5 |
| 71 | Claude (Computer Use) | 17.1 |
| 83 | GPT-4o | 0.8 |
Key takeaways:
- The best model (KV-Ground-GuiOwl1.5-8B with ZoomIn) achieves 80.5% — a massive leap from the original paper’s best of 18.9%, driven by visual search strategies that narrow the search area
- Agentic / multi-round methods dominate the top ranks — models that zoom into candidate regions outperform single-pass approaches
- General-purpose VLMs (GPT-4o, Claude Computer Use) still struggle severely on direct grounding in professional high-res environments
- Even GPT-5 in minimal mode reaches only 18.5% when images are simply resized
Where to Explore the Benchmark
Dashboards and Resources
| Resource | Description | Link |
|---|---|---|
| Official Leaderboard | Live leaderboard with per-application breakdown across 23 software | gui-agent.github.io/grounding-leaderboard |
| GitHub Repository | Evaluation code, configs, and inference scripts | github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding |
| Hugging Face Dataset | The 1,581-task dataset with screenshots and annotations | huggingface.co/datasets/likaixin/ScreenSpot-Pro |
| arXiv Paper | Full technical paper with methodology and analysis | arxiv.org/abs/2504.07981 |
Load the Dataset
from datasets import load_dataset
dataset = load_dataset("likaixin/ScreenSpot-Pro")
print(f"Number of tasks: {len(dataset['test'])}")
# Number of tasks: 1581Understanding the Metrics
Click Accuracy
The primary metric is straightforward: given a model’s predicted click point (x, y), does it fall inside the annotated ground-truth bounding box? For models that output bounding boxes instead of points, the center of the predicted box is used.
Per-Category Breakdown
The leaderboard reports accuracy per application, which reveals where models excel vs. struggle:
| Category | Challenge Level | Why |
|---|---|---|
| Office Suite | Moderate | Familiar UI patterns, used in web training data |
| OS Commons | Moderate | Standard system interfaces |
| Development | Hard | Dense code editors, many small icons |
| Creative | Very Hard | Custom UIs, non-standard toolbars |
| CAD & Engineering | Very Hard | Extremely dense, specialized icons |
| Scientific | Very Hard | Domain-specific plots, menus with many entries |
Text vs. Icon Targets
Icons are consistently harder to ground than text elements — models can leverage OCR capabilities for text but must rely on visual understanding for icons. In the original paper, OS-Atlas-7B scored 28.1% on text but only 4.0% on icons.
ScreenSeekeR: The Breakthrough Approach
The paper introduced ScreenSeekeR, an agentic visual search framework that dramatically improves grounding accuracy by narrowing the search area rather than trying to locate elements in the full high-resolution image. This insight — that reducing the search space matters more than increasing model size — proved foundational for the leaderboard leaders.
graph TD
A["Full Screenshot<br/>High resolution"] --> B["Planner (GPT-4o)<br/>Predicts candidate regions"]
B --> C["Score & Filter<br/>Candidate areas"]
C --> D["Crop & Zoom<br/>Into top candidates"]
D --> E["Grounder Model<br/>Locates target in<br/>simplified sub-image"]
E --> F["Verify Result<br/>Planner checks<br/>correctness"]
style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
style B fill:#3498db,color:#fff,stroke:#333
style C fill:#f39c12,color:#fff,stroke:#333
style D fill:#e67e22,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
style F fill:#8e44ad,color:#fff,stroke:#333
ScreenSeekeR boosted OS-Atlas-7B from 18.9% → 48.1% — a 254% relative improvement — without any additional training. This cascaded zoom-and-search approach inspired many of the top leaderboard methods (ZoomIn, MVP, Agentic variants).
Why ScreenSpot-Pro Matters
graph LR
A["GUI agents need<br/>professional software<br/>capabilities"] --> B["Existing benchmarks<br/>too simple"]
B --> C["ScreenSpot-Pro<br/>fills the gap"]
C --> D["Better GUI agents<br/>for real productivity"]
A2["High-res screens<br/>tiny UI targets"] --> B2["Models fail at<br/>precise localization"]
B2 --> C
C --> D2["Focus on<br/>visual search<br/>strategies"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Tests what matters for real productivity — Professional software is where GUI agents could deliver the most value, yet it’s the hardest environment
- Exposes the resolution bottleneck — Models that work on cropped screenshots fail catastrophically at full-screen resolution
- Validates visual search — The massive gap between single-pass (18.9%) and agentic zoom approaches (80.5%) proves that search strategy is critical
- Diverse and authentic — 23 applications across 5 industries, annotated by domain experts during real workflows
- Active community — 84 model submissions on the leaderboard and growing
Video: ScreenSpot-Pro Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
ScreenSpot-Pro reveals a critical truth about AI GUI agents:
- 1,581 expert-annotated tasks across 23 professional applications — from Photoshop and AutoCAD to MATLAB and Blender
- Targets occupy only 0.07% of the screen — 29× smaller than the original ScreenSpot benchmark
- General-purpose models like GPT-4o score < 1% on direct grounding in professional environments
- Visual search strategies (zoom-and-crop) are the key breakthrough, with the best agentic methods reaching 80.5%
- The gap between single-pass (18.9%) and multi-round approaches (80.5%) proves that the problem is not just about better models but about smarter search
As GUI agents evolve from web browsing toys into serious productivity tools, ScreenSpot-Pro provides the benchmark that measures whether they can handle the software that professionals actually use.
References
- Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., Chua, T.-S. “ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use.” Workshop on Reasoning and Planning for Large Language Models, 2025. arxiv.org/abs/2504.07981
- Li, K. et al. “ScreenSpot-Pro Leaderboard.” gui-agent.github.io/grounding-leaderboard (consulted March 29, 2026)
- Li, K. et al. “ScreenSpot-Pro Dataset.” Hugging Face. huggingface.co/datasets/likaixin/ScreenSpot-Pro
- Li, K. et al. “ScreenSpot-Pro GitHub Repository.” github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding
Read More
- Explore how models handle document understanding — see OmniDocBench 1.5
- Evaluate multimodal models on college-level visual reasoning — see MMMU-Pro
- Understand expert-level AI evaluation — see Humanity’s Last Exam
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- ScreenSpot-Pro Leaderboard
- ScreenSpot-Pro Dataset on Hugging Face
- ScreenSpot-Pro GitHub Repository